9/16/2020

the Grammar of Graphics beyond ggplot2

  • The grammar of graphics outlined by ggplot2 provides a uniform way to both make plots, and to write the code for making other plotting libraries.
  • Because of this, there is a wide range of other packages that additional functionality to ggplot, while following the same standard to allow for easy use and interoperability

installing external libraries

  • recall that we can use install.packges to download packages from CRAN. However, the requirements for CRAN can be a little restrictive, and as new packages are developed, there are often long delays between uploads to CRAN.
  • almost all package developers use GitHub to host their data
  • the devtools package provides an easy way to download such packages
remove.packages('RBedtools')
## Removing package from '/Users/swamyvs/Library/R/4.0/library'
## (as 'lib' is unspecified)
devtools::install_github('vinay-swamy/RBedtools', upgrade = F)
## Using github PAT from envvar GITHUB_PAT
## Downloading GitHub repo vinay-swamy/RBedtools@HEAD
##   
   checking for file ‘/private/var/folders/43/bxszl5k54ml3ljc8rqdcc7vchdgq7y/T/RtmpSmVzsB/remotesa7e01b1fc3e4/vinay-swamy-RBedtools-e155d87/DESCRIPTION’ ...
  
✓  checking for file ‘/private/var/folders/43/bxszl5k54ml3ljc8rqdcc7vchdgq7y/T/RtmpSmVzsB/remotesa7e01b1fc3e4/vinay-swamy-RBedtools-e155d87/DESCRIPTION’
## 
  
─  preparing ‘RBedtools’:
## 
  
   checking DESCRIPTION meta-information ...
  
✓  checking DESCRIPTION meta-information
## 
  
─  checking for LF line-endings in source and make files and shell scripts
## 
  
─  checking for empty or unneeded directories
## 
  
─  building ‘RBedtools_0.2.0.tar.gz’
## 
  
   Warning: invalid uid value replaced by that for user 'nobody'
##    Warning: invalid gid value replaced by that for user 'nobody'
## 
  
   
## 
## Installing package into '/Users/swamyvs/Library/R/4.0/library'
## (as 'lib' is unspecified)

the ggpubr library

  • the ggpubr is built on top of ggplot, but provides a more streamlined interface to use. It was designed to be used quickly and easily for making scientific plots for users with little to know programming experience.
  • this is the code to make a violin plot in ggpubr - one function call to make a single plot
library(ggpubr)
ggviolin(ToothGrowth, x = "dose", y = "len", fill = "dose",
         palette = c("#00AFBB", "#E7B800", "#FC4E07"),
         add = "boxplot", add.params = list(fill = "white"))

the ggpubr library

ggviolin(ToothGrowth, x = "dose", y = "len", fill = "dose",
         palette = c("#00AFBB", "#E7B800", "#FC4E07"),
         add = "boxplot", add.params = list(fill = "white"))

  • all the aesthetics you would need to change are specified as part of the function

the ggpubr library

  • testing for statistical significance is also easy to add to the plot
ggviolin(ToothGrowth, x = "dose", y = "len", fill = "dose",
         palette = c("#00AFBB", "#E7B800", "#FC4E07"),
         add = "boxplot", add.params = list(fill = "white"))+  
  stat_compare_means(comparisons = list( c("0.5", "1"),  c("0.5", "2") ), 
                     label = "p.signif")+
  stat_compare_means(label.y = 50) 

An aside

  • My biggest qualm with ggpubr is that it makes it too easy to apply a statistical analysis without really understanding what’s being used.

the ggpubr library

  • ggpubr will can also apply transformations to data for you
  • For example, summarizing a range of point to there average value
ggline(ToothGrowth, "dose", "len",
   linetype = "supp", shape = "supp",
   color = "supp", palette = c("#00AFBB", "#E7B800"), add='mean')

the ggpubr library

  • customizing plots is a lot easier as well
ggline(ToothGrowth, "dose", "len",
   linetype = "supp", shape = "supp",
   color = "supp", palette = c("#00AFBB", "#E7B800"), add='mean_se')

  • While ggpubr excels for making quick and descriptive plots, it lacks a lot of flexibility. Use it when when you need something out of the box, but if you need something a little more complicated, stick with ggplot

adding patterns to plots

  • Using patterns has a long history in scientific plots.
  • patterns provide another “dimension” to showcase a categorical variable
  • patterns are available through the ggpattern package
library(ggpattern)
df <- data.frame(level = c("a", "b", "c", 'd'), outcome = c(2.3, 1.9, 3.2, 1))

ggplot(df) +
  geom_col_pattern(
    aes(level, outcome, pattern_fill = level), 
    pattern = 'stripe',
    fill    = 'white',
    colour  = 'black'
  ) +
  theme_bw(18) +
  theme(legend.position = 'none') + 
  labs(
    title    = "ggpattern::geom_pattern_col()",
    subtitle = "pattern = 'stripe'"
  ) +
  coord_fixed(ratio = 1/2)

adding patterns to plots

  • with a little bit more code, multiple patterns can be added
iris_dfac <-  iris %>% 
  mutate(dfac = sample(c('long', 'short'),size = nrow(.), replace = T)) %>% 
  group_by(Species, dfac) %>% 
    summarise(dcol = n())
## `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
ggplot(iris_dfac)+ 
    geom_col_pattern(aes(x=Species, y=dcol, pattern_density = dfac, fill = Species),
                     pattern_color = 'white',
                     color='black',
                     position = 'dodge', pattern_fill = 'white', 
                     pattern= 'crosshatch')  + 
    theme_classic()

lets do some examples

external themes

  • While the base ggplot themes are nice, there are many published themes and palettes that you can use with your plots
  • We’ll cover two packages,ggthemes and ggsci
library(ggplot2)
library(ggthemes)
ggplot(iris) + 
  geom_boxplot(aes(x= Species, y= Sepal.Length, fill = Species)) + 
  theme_minimal()

external themes

  • a theme mimicking the Wall Street Journal
ggplot(iris) + 
  geom_boxplot(aes(x= Species, y= Sepal.Length, fill = Species)) + 
  ggthemes::theme_wsj()

ggsci

  • many journals and magazines have specific color palettes they use for their publications. the ggsci provides close approximation for several of these
library("ggsci")
data(diamonds)
diamonds <- diamonds %>% mutate(cut = as.character(cut))
p1 = ggplot(subset(diamonds, carat >= 2.2),
       aes(x = table, y = price, colour = cut)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "loess", alpha = 0.05, size = 1, span = 1) +
  theme_bw()
p1
## `geom_smooth()` using formula 'y ~ x'

ggsci

  • This is a theme popular in Nature publications
p1 + scale_color_npg()  
## `geom_smooth()` using formula 'y ~ x'

ggsci

  • This is a theme popular in Science publications
p1 + scale_color_aaas()
## `geom_smooth()` using formula 'y ~ x'

- and there are many more

animated plots

  • animated plots are all the rage these days, and thanks to the gganimate package, making animated plots is super easy
knitr::include_graphics('../src/economic_lg_ani.gif')

animated plots

  • gganimate works by having transition_... functions that can change the color or positions of geoms based on time or another continuous variable

  • the code to make the animated plot

library(gganimate)
data("economics_long")
ani_plot <- ggplot(economics_long)  +
  geom_line(aes(x=date, y=value01, color = variable)) + 
  theme_bw()+
  transition_reveal(date)
#anim_save('../src/economic_lg_ani.gif', ani_plot)

lets do some examples

Plotting Maps in R

  • R was well known in its early years for having good libraries for making plots with maps.
  • While not too common in biological sciences, map based plots are common in fields like public health and ecology
  • The libraries for making maps are notoriously difficult to install, especially on a Mac. We’ll go through how to install its later, but it will probably be easier if you use the binder link in the repo

the sf package

  • sf stands for “simple features”, and is a way for representing the physical diagram of a map as a dataframe. This format is not unique to R, and is used by many different programs
  • sf is the successor to the sp package, which was what gave maps in R its fame

the sf package

  • This is some example data provided in the sf package. It contains geospatial information about the state of North Carolina. you can recognize geospatial data by the .shp extension. the st_read function allows you to read .shp data into R. Within R, it exists as a custom class, also called `sf
library(sf)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
## Reading layer `nc' from data source 
##   `/Users/swamyvs/Library/R/4.0/library/sf/shape/nc.shp' using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS:  NAD27
nc
## Simple feature collection with 100 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
## Geodetic CRS:  NAD27
## First 10 features:
##     AREA PERIMETER CNTY_ CNTY_ID        NAME  FIPS FIPSNO CRESS_ID BIR74 SID74
## 1  0.114     1.442  1825    1825        Ashe 37009  37009        5  1091     1
## 2  0.061     1.231  1827    1827   Alleghany 37005  37005        3   487     0
## 3  0.143     1.630  1828    1828       Surry 37171  37171       86  3188     5
## 4  0.070     2.968  1831    1831   Currituck 37053  37053       27   508     1
## 5  0.153     2.206  1832    1832 Northampton 37131  37131       66  1421     9
## 6  0.097     1.670  1833    1833    Hertford 37091  37091       46  1452     7
## 7  0.062     1.547  1834    1834      Camden 37029  37029       15   286     0
## 8  0.091     1.284  1835    1835       Gates 37073  37073       37   420     0
## 9  0.118     1.421  1836    1836      Warren 37185  37185       93   968     4
## 10 0.124     1.428  1837    1837      Stokes 37169  37169       85  1612     1
##    NWBIR74 BIR79 SID79 NWBIR79                       geometry
## 1       10  1364     0      19 MULTIPOLYGON (((-81.47276 3...
## 2       10   542     3      12 MULTIPOLYGON (((-81.23989 3...
## 3      208  3616     6     260 MULTIPOLYGON (((-80.45634 3...
## 4      123   830     2     145 MULTIPOLYGON (((-76.00897 3...
## 5     1066  1606     3    1197 MULTIPOLYGON (((-77.21767 3...
## 6      954  1838     5    1237 MULTIPOLYGON (((-76.74506 3...
## 7      115   350     2     139 MULTIPOLYGON (((-76.00897 3...
## 8      254   594     2     371 MULTIPOLYGON (((-76.56251 3...
## 9      748  1190     2     844 MULTIPOLYGON (((-78.30876 3...
## 10     160  2038     5     176 MULTIPOLYGON (((-80.02567 3...
class(nc)
## [1] "sf"         "data.frame"

the sf package

  • the most important part of an sf object is the geometry column; this contains the actual information about how to draw each map

  • sf objects can be easily coerced to a tibble

nc_table <- nc %>% as_tibble
nc_table
## # A tibble: 100 × 15
##     AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74 NWBIR74
##    <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>   <dbl>
##  1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091     1      10
##  2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0      10
##  3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188     5     208
##  4 0.07       2.97  1831    1831 Curr… 37053  37053       27   508     1     123
##  5 0.153      2.21  1832    1832 Nort… 37131  37131       66  1421     9    1066
##  6 0.097      1.67  1833    1833 Hert… 37091  37091       46  1452     7     954
##  7 0.062      1.55  1834    1834 Camd… 37029  37029       15   286     0     115
##  8 0.091      1.28  1835    1835 Gates 37073  37073       37   420     0     254
##  9 0.118      1.42  1836    1836 Warr… 37185  37185       93   968     4     748
## 10 0.124      1.43  1837    1837 Stok… 37169  37169       85  1612     1     160
## # … with 90 more rows, and 4 more variables: BIR79 <dbl>, SID79 <dbl>,
## #   NWBIR79 <dbl>, geometry <MULTIPOLYGON [°]>

the sf package

-sf was designed to be used with base R’s plotting interface

plot(nc)

the sf package

  • but since plotting in base R isn’t great, we’ll be combining sf with another package, ggspatial.
  • when using geom_sf, we can directly pass sf objects. When doing this, aesthetic mappings are automatically generated

the ggspatial package

library(ggspatial)
ggplot(data = nc ) +
    geom_sf() + 
    theme_bw()

the ggspatial package

  • when using using a tibble as input, manually specify the aesthetic mapping
ggplot(nc_table) + 
    geom_sf(aes( geometry= geometry)) + 
    theme_bw()

the ggspatial package

  • from here, we can treat the plot like a normal ggplot, changing aesthetics as we please
ggplot(nc_table) + 
    geom_sf(aes( geometry= geometry, fill = SID74)) + 
    theme_bw()

Using Maps with real data

  • Almost always, real data does not come with geometry coordinates attached. if you’re lucky, you’ll get tabular data that has some kind of easy geographical identifier, like state, country, or zip code.
  • Here’s an (slightly morbid) example - Violent Crime Arrests in the US
data("USArrests")
USArrests
##                Murder Assault UrbanPop Rape
## Alabama          13.2     236       58 21.2
## Alaska           10.0     263       48 44.5
## Arizona           8.1     294       80 31.0
## Arkansas          8.8     190       50 19.5
## California        9.0     276       91 40.6
## Colorado          7.9     204       78 38.7
## Connecticut       3.3     110       77 11.1
## Delaware          5.9     238       72 15.8
## Florida          15.4     335       80 31.9
## Georgia          17.4     211       60 25.8
## Hawaii            5.3      46       83 20.2
## Idaho             2.6     120       54 14.2
## Illinois         10.4     249       83 24.0
## Indiana           7.2     113       65 21.0
## Iowa              2.2      56       57 11.3
## Kansas            6.0     115       66 18.0
## Kentucky          9.7     109       52 16.3
## Louisiana        15.4     249       66 22.2
## Maine             2.1      83       51  7.8
## Maryland         11.3     300       67 27.8
## Massachusetts     4.4     149       85 16.3
## Michigan         12.1     255       74 35.1
## Minnesota         2.7      72       66 14.9
## Mississippi      16.1     259       44 17.1
## Missouri          9.0     178       70 28.2
## Montana           6.0     109       53 16.4
## Nebraska          4.3     102       62 16.5
## Nevada           12.2     252       81 46.0
## New Hampshire     2.1      57       56  9.5
## New Jersey        7.4     159       89 18.8
## New Mexico       11.4     285       70 32.1
## New York         11.1     254       86 26.1
## North Carolina   13.0     337       45 16.1
## North Dakota      0.8      45       44  7.3
## Ohio              7.3     120       75 21.4
## Oklahoma          6.6     151       68 20.0
## Oregon            4.9     159       67 29.3
## Pennsylvania      6.3     106       72 14.9
## Rhode Island      3.4     174       87  8.3
## South Carolina   14.4     279       48 22.5
## South Dakota      3.8      86       45 12.8
## Tennessee        13.2     188       59 26.9
## Texas            12.7     201       80 25.5
## Utah              3.2     120       80 22.9
## Vermont           2.2      48       32 11.2
## Virginia          8.5     156       63 20.7
## Washington        4.0     145       73 26.2
## West Virginia     5.7      81       39  9.3
## Wisconsin         2.6      53       66 10.8
## Wyoming           6.8     161       60 15.6

the sf package

  • the general workflow for plotting maps is taking an sf object that has defined geometry, and then connecting the sf data to our tabular data.

  • there are multiple packages for obtaining map data, but its often not in sf format. luckily, the sf package provides the function st_as_sf to convert other formats to sf format

the sf package

library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
library(sf)

states <- maps::map("state", plot = FALSE, fill = TRUE) %>% st_as_sf() %>% 
    as_tibble %>% 
    rename(state = ID, 
           geometry = geom)


states
## # A tibble: 49 × 2
##    state                                                                geometry
##    <chr>                                                      <MULTIPOLYGON [°]>
##  1 alabama          (((-87.46201 30.38968, -87.48493 30.37249, -87.52503 30.372…
##  2 arizona          (((-114.6374 35.01918, -114.6431 35.10512, -114.603 35.1223…
##  3 arkansas         (((-94.05103 33.03675, -94.05103 33.30031, -94.05676 33.575…
##  4 california       (((-120.006 42.00927, -120.006 41.20139, -120.006 39.70024,…
##  5 colorado         (((-102.0552 40.00964, -102.061 40.00391, -102.0552 39.5799…
##  6 connecticut      (((-73.49902 42.04937, -73.04066 42.04364, -73.01201 42.043…
##  7 delaware         (((-75.80231 39.72889, -75.76221 39.72889, -75.74503 39.757…
##  8 district of col… (((-77.13731 38.94394, -77.06283 38.99551, -77.01699 38.972…
##  9 florida          (((-85.01548 30.99702, -84.99829 30.96264, -84.97537 30.922…
## 10 georgia          (((-80.89018 32.0398, -80.85007 32.02834, -80.84435 32.0168…
## # … with 39 more rows

the sf package

arrests <- USArrests %>% rownames_to_column('state')
arrests
##             state Murder Assault UrbanPop Rape
## 1         Alabama   13.2     236       58 21.2
## 2          Alaska   10.0     263       48 44.5
## 3         Arizona    8.1     294       80 31.0
## 4        Arkansas    8.8     190       50 19.5
## 5      California    9.0     276       91 40.6
## 6        Colorado    7.9     204       78 38.7
## 7     Connecticut    3.3     110       77 11.1
## 8        Delaware    5.9     238       72 15.8
## 9         Florida   15.4     335       80 31.9
## 10        Georgia   17.4     211       60 25.8
## 11         Hawaii    5.3      46       83 20.2
## 12          Idaho    2.6     120       54 14.2
## 13       Illinois   10.4     249       83 24.0
## 14        Indiana    7.2     113       65 21.0
## 15           Iowa    2.2      56       57 11.3
## 16         Kansas    6.0     115       66 18.0
## 17       Kentucky    9.7     109       52 16.3
## 18      Louisiana   15.4     249       66 22.2
## 19          Maine    2.1      83       51  7.8
## 20       Maryland   11.3     300       67 27.8
## 21  Massachusetts    4.4     149       85 16.3
## 22       Michigan   12.1     255       74 35.1
## 23      Minnesota    2.7      72       66 14.9
## 24    Mississippi   16.1     259       44 17.1
## 25       Missouri    9.0     178       70 28.2
## 26        Montana    6.0     109       53 16.4
## 27       Nebraska    4.3     102       62 16.5
## 28         Nevada   12.2     252       81 46.0
## 29  New Hampshire    2.1      57       56  9.5
## 30     New Jersey    7.4     159       89 18.8
## 31     New Mexico   11.4     285       70 32.1
## 32       New York   11.1     254       86 26.1
## 33 North Carolina   13.0     337       45 16.1
## 34   North Dakota    0.8      45       44  7.3
## 35           Ohio    7.3     120       75 21.4
## 36       Oklahoma    6.6     151       68 20.0
## 37         Oregon    4.9     159       67 29.3
## 38   Pennsylvania    6.3     106       72 14.9
## 39   Rhode Island    3.4     174       87  8.3
## 40 South Carolina   14.4     279       48 22.5
## 41   South Dakota    3.8      86       45 12.8
## 42      Tennessee   13.2     188       59 26.9
## 43          Texas   12.7     201       80 25.5
## 44           Utah    3.2     120       80 22.9
## 45        Vermont    2.2      48       32 11.2
## 46       Virginia    8.5     156       63 20.7
## 47     Washington    4.0     145       73 26.2
## 48  West Virginia    5.7      81       39  9.3
## 49      Wisconsin    2.6      53       66 10.8
## 50        Wyoming    6.8     161       60 15.6

the sf package

inner_join(arrests, states)
## Joining, by = "state"
## [1] state    Murder   Assault  UrbanPop Rape     geometry
## <0 rows> (or 0-length row.names)
  • What happened?

always check your joins!

  • in this case the cases in the state column doesn’t match
states <- states %>% 
    mutate(state = tolower(state))

arrestData_with_mapData <- arrests %>%
    mutate(state = tolower(state)) %>% 
    inner_join( states)
## Joining, by = "state"
ggplot(arrestData_with_mapData) + 
    geom_sf(aes(geometry = geometry, fill = Assault)) + 
    theme_bw()

cleaned up map

  • lets pretty it up a little
ggplot(arrestData_with_mapData) + 
    geom_sf(aes(geometry = geometry, fill = Assault)) + 
    scale_fill_viridis_c('Hundreds of Persons') + 
    ggtitle('Number of Assaults per state')+
    theme_void()

using coordinate systems

  • sometimes you don’t even get a region, and just get latitudes, and longitudes

  • this is some example data about the locations of earthquakes in Fiji

data("quakes")
quakes %>% head
##      lat   long depth mag stations
## 1 -20.42 181.62   562 4.8       41
## 2 -20.62 181.03   650 4.2       15
## 3 -26.00 184.10    42 5.4       43
## 4 -17.97 181.66   626 4.1       19
## 5 -20.42 181.96   649 4.0       11
## 6 -19.68 184.31   195 4.0       12

using coordinate systems

library(rnaturalearth)
library(rnaturalearthdata)
world <- ne_countries(scale = "medium", returnclass = "sf")

quakes_sf <- quakes %>% st_as_sf(coords = c("long", "lat"),crs = 3460 ) %>% as_tibble 
ggplot() +
    geom_sf(data = quakes_sf,  aes(geometry = geometry), alpha = .1)+
    geom_sf(data = world, color = 'red') +
    coord_sf( ylim = c(-50, -10),  xlim = c(165, 200)) + 
    theme_bw()

lets do some practice